Provenance for SPARQL queries 



C. V. Damasio 1 and A. Analyti 2 and G. Antoniou 3 

1 CENTRIA, Departamento de Informatica Faculdade de Ciencias e Tecnologia 
Universidade Nova de Lisboa, 2829-516 Caparica, Portugal. 
cd@fct.unl.pt 

2 Institute of Computer Science, FORTH-ICS, Crete, Greece 
analytiOics . forth. gr 
3 Institute of Computer Science, FORTH-ICS, and 
Department of Computer Science, University of Crete, Crete, Greece 
antoniou@ics . forth. gr 



Abstract. Determining trust of data available in the Semantic Web 
is fundamental for applications and users, in particular for linked open 
data obtained from SPARQL endpoints. There exist several proposals in 
the literature to annotate SPARQL query results with values from ab- 
stract models, adapting the seminal works on provenance for annotated 
relational databases. We provide an approach capable of providing prove- 
nance information for a large and significant fragment of SPARQL 1.1, 
including for the first time the major non-monotonic constructs under 
multiset semantics. The approach is based on the translation of SPARQL 
into relational queries over annotated relations with values of the most 
general m-semiring, and in this way also refuting a claim in the literature 
that the OPTIONAL construct of SPARQL cannot be captured appropri- 
ately with the known abstract models. 
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This document presents the proof of the main result of the paper entitled "Prove- 
nance for SPARQL queries" to appear in Proceedings of ISWC 2012 Boston, 
edited by Bernstein, Cudre-Mauroux, Heflin et al., Springer. The original pub- 
lication will be available at www.springerlink.com and this document will be 
substituted by the authors' version after publication. 



A Proof of the main result 

The rationale for obtaining how-provenance for SPARQL is to represent each 
solution mapping as a tuple of a relational algebra query constructed from the 
original SPARQL graph pattern. The construction is intricate and fully speci- 
fied, and is inspired from the translation of full SPARQL 1.0 queries into SQL, 



as detailed in [BJ, and into Datalog in [TT]. Here, we follow a similar strat- 
egy but for simplicity of presentation we assume that a given RDF dataset 
D = {Go, (<ui >, Gi), (<u 2 >, G2), . . . , (<u n >, G„)} is represented by the two 
relations: Graphs (gid,IRI) and Quads (gid,sub,pred,obj). The former stores 
information about the graphs in the dataset D where gid is a numeric graph 
identifier, and IRI an IRI reference. The relation Quads stores the triples of every 
graph in the RDF dataset. Different implementations may immediately adapt 
the translation provided here in this section to their own schema. 

Relation Graphs (gid, IRI) contains a tuple (i,<Ui>) for each named graph 
(< ui >,Gi), and the tuple (0, < >) for the default graph, while relation 
Quads (gid, sub, pred.obj) stores a tuple of the form (i,s,p,o) for each triple 
(s,p, o) G G0. With this encoding, the default graph always has identifier 0, and 
all the graph identifiers are consecutive integers. 

It is also assumed the existence of a special value unb, distinct from the en- 
coding of any RDF term, to represent that a particular variable is unbound in 
the solution mapping. This is required in order to be able to represent solution 
mappings as tuples with fixed and known arity. Moreover, we assume that the 
variables are totally ordered (e.g. lexicographically). The translation requires the 
full power of relational algebra, and notice that bag semantics is assumed (du- 
plicates are allowed) in order to obey to the cardinality restrictions of SPARQL 
algebra operators [T] . 

Theorem 1 (Correctness of translation). Given a graph pattern P and a 
RDF dataset D(G) the process of evaluating the query is performed as follows: 

1. Construct the base relations Graphs and Quads from D{G); 



2. 



with re- 



Evaluate [SPARQL(P,D(G),V)] n = II V a G , =0 ([()]% n [P}% 

sped to the base relations Graphs and Quads, where G' is a new attribute 
name and V C var(P). 



Moreover, the tuples of relational algebra query (0) are in one-to-one correspon- 
dence with the solution mappings of [-P]^^) when V = var(P), and where an 
attribute mapped to unb represents that the corresponding variable does not be- 
long to the domain of the solution mapping. 

Proof. The induction proof will construct an expression containing as attributes 
the graph attribute and an attribute for each in-scope variable of the graph 
pattern. Assume that the active graph is given and has id j (0 for the default 
graph, or 1 < j < n for the case of a named graph). We show that the cardinality 
of solutions of each SPARQL algebra operator is respected for the active graph, 
which is the most difficult part. Notice that the graphs by definition do not have 
duplicate triples, and therefore there will not be duplicates in the original base 
relations Graphs and Quads. 



4 For simplicity sub, pred, and obj are text attributes storing lexical forms of the 
triples' components. We assume that datatype literals have been normalized, and 
blank nodes are distinct in each graph. The only constraint is that different RDF 
terms must be represented by different strings; this can be easily guaranteed. 
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Empty graph pattern Recall that he empty graph pattern is translated into 
10] ti — [/OG^gid(Graphs)] . This means that the result of the relational al- 
gebra query returns as many as tuples as graphs in the RDF dataset, and at 
least a tuple with identifier for the initial default graph. Moreover, note that 
for each particular graph id (default or named) this relation has exactly 1 tuple 
with that id. 

Triple pattern Since there are no duplicate triples in the graphs, then the number 
of possible solutions of a triple pattern are exactly the number of triples that 
match the pattern, one for each instance. We analyse the correctness of the 
translation of the triple pattern according to the number of variables that may 
occur on it: no variables, one variable, two variables and three variables. 

vars: In this case the triple pattern t — (t\,t2,ts) contains only RDF terms 

which might be identical, or not. In this case we obtain the relational alge- 
bra expression II G [p G ^ s id (o- SU b=t 1 A P red=t 2 Aobj=t 3 (Quads))] . The resulting 
relation has just the attribute (column) G. If the triple occurs in graph i 
then the resulting relation will contain a tuple having value i in attribute G; 
otherwise no tuple will occur for graph i. 

1 var: We have several cases to consider here: there is just one occurrence of a 

variable, two or three. 

Consider that the variable occurs in the subject of the triple; for the remain- 
ing cases in predicate or object the reasoning is similar. So, let t = (?w, ^2,^3), 
obtaining the relational algebra expression 

IIg,v [pG^gicM^sub (opred=t 2 Aobj=t3 (Quads))] 

The resulting relation has two columns, one for the graph G and one for 
collecting the bindings for v (i.e. the solution). If the triple occurs in graph 
i then the resulting relation will contain a tuple having value i in attribute 
G; otherwise no tuple will occur for graph i. 

Consider now that the variable occurs in the subject and predicate of the 
triple; for the remaining cases the reasoning is similar. So, let t = (?v, Iv, t 3 ), 
obtaining the relational algebra expression 

IIg,v [pG^gid,t)^sub (o" su b=predAobj=t 3 (Q uads ))] 

The resulting relation has two columns, one for the graph G and one for 
collecting the bindings for v (i.e. the solution). The selection condition guar- 
antees that the obtained instances match the triple pattern, obtaining for 
each graph i one tuple for each possible value of v. 

If there are three occurrences of the variable Iv then t = (?v,?v,7v), then 
one obtains the expression 

IIg.v [/?G^gid,tJ^sub ( 'sub=predAsub=objApred=obj (Quads))] 

Note that one of the equalities in the selection condition is redundant. As 
before, for each graph, we get a tuple for each possible value of v. 
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2 vars: There are two variant of patterns here t = (?ul, tv2, i 3 ) and i = (?ul, ?«2, ?u2) 

and permutations. For the first pattern we get 

,vl,v2 [PG^gid sub,u2-<— pred (o- bj=t 3 (Quads))] 
while for the second we get 

I1g,v1,v2 [pG^gid ,vl<— sub, v2<— pred 

(o-pred=obj (Quads))] 

It is easy to see that each solution (for each graph) corresponds exactly to 
one tuple in the evaluation of the translated relation. 

3 vars: This case is immediate and being the triple pattern t = (?vl, ?u2, ?v3). 

The translation is n G , Vl ,v 2 ,v 3 [pG«- S id,i;i<-siib,i;2«-pred,i;3<-ot>j (Quads)] , obtain- 
ing a relation which is isomorphic to Quads, as expected. For each graph 
i, we obtain a tuple with value i in attribute G and remaining attributes 
corresponding exactly to one triple in graph i. 

UNION pattern The translated relational algebra expression [(Pi UNION fb)]^, is: 

U 

nG,var(P 2 )U{v^unb\vevar(P 1 )\var(P 2 )} ([P^lll) 

The relational algebra expressions makes the union of two projections. Each 
projection will not remove any in-scope variable, and it is used to extend the 
columns with unbound values in order to obtain a relation with columns G U 
var(Pi)Uvar(P2), by making unbound the variables that do not occur in the sub- 
pattern [Pi\n or [P2]^, respectively. Therefore, each subexpression the projection 
operator will not remove any in-scope attributes of each sub-pattern [Pl]^ or 
\P2\n, and thus it returns exactly as many as tuples as the number of solutions 
of each sub-pattern by induction hypothesis. The cardinality of the resulting 
expression is the sum of solutions of each sub-expression, according to the bag 
semantics of relational algebra. 

AND pattern The translated relational algebra expression [(Pi AND Pz)]^* is: 

/ Y 
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G, 

var(Pi) — var(P2), 
var(P2) — var(Pi), 
vi <— f 'ir stiv^ , v") , . 
v n <— first(v' n ,v'n) 



' comp 



P v[ Vl 



X P v'l 



J 



where comp is a conjunction of conditions v[ — unb \Zv" = uribVt)- = v" for each 
common variable < i < n). The function first returns the first argument 
which is not unb, or unb if both arguments are unb. 
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The joined subexpressions inside the selection have only the common at- 
tribute G due to the renaming of common variables. So, every tuple of the 

G G 

subexpression for [Pi] n will join with every tuple of [i^]^, for each graph. If 
there is a solution with cardinality ci of Pi and solution with cardinality c 2 of 
P 2 , one will obtain ci x c 2 tuples in the result of the joined expression. From 
these possibilities, the selection expression keeps he combinations of solutions 
which are compatible: for each pair v\ and v" the condition guarantees that at 
least 1 variable is unbound or have the same value. So, we only keep the tuples 
for each possible merge (with cardinality ci x c 2 since selection keeps duplicate 
tuples). The projection is necessary to obtain the relation on the original vari- 
ables, besides G, obtaining the value from the first bound variable, if any. It is 
also necessary to recall that the bag semantics for the projection operator will 
obtain as many tuples as the contributing tuples, summing over the obtained 
solutions as required by the cardinality condition of the AND operator (see the 
definition of the U for i-T-relations) . 

FILTER pattern The relational algebra expression [(P FILTER R)]% is 



n, 



G,var(P) 



^filter 



n Ei xi ... 1x1 E„ 



where filter is a condition obtained from R where each occurrence of EXISTS(Pi) 
(resp. NOT EXISTS(P)) in R is substituted by condition ex t <> (resp. exi = 0), 
where exi is a new attribute name. Expression Pj(l < i < m) is: 



n, 



G,var(P),eXi<^0 



( 



s(p') - n G var{P) 
U 



& subst 



V 



n, 



G ,var(P) ,exi<— 1 



S(P') 



V 



V n <- v n 



J 



O subst 



V 



V 



where P' = 



\P\%- P' 



— \P%ifii 



n , ij — i-t and subst is the conjunction of conditions V{ — 
v\y Vi = unb for each variable Vi in var(P) D var(Pi) = {vi, . . . , v n }. 

The translation is complex due to the EXISTS expressions. Note that each 
Ei expression returns exactly one tuple for each solution of pattern P. Thus, 
the duplicate removal operations are there just to guarantee this and do no 
affect the cardinality of the solutions, obtaining one solution for each solution 
of P that obeys to the filter condition. The rest of the translation is more or 
less immediate, where the condition subst does not correspond exactly to the 
compatibility condition used before, since according to the SPARQL semantics 
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the variables of pattern Pi are substituted in the EXISTS pattern (we discard the 
cases where a variable is bound by Pi and not bound in Pi). 



MINUS pattern The relational algebra expression [(Pi MINUS Pz)]-ji is 

/ \ 



G,var{Px) 



^compA->disj 



V 
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where comp is a conjunction of conditions Vi = unb V v[ = unb V u, — v[ for 
each variable common ii,(l < i < n), and disj is the conjunction of conditions 
Vi = unb V v • = unb for each variable Vi(l < i < n). 

G 

The difference expression will return either a tuple of [Pi] TC or not, without 
duplicates. The difference expression evaluation will return solutions of Pi that 
do not belong to the expression in the right-hand side of the difference. Since 
the right-hand side of expression obtains the tuples corresponding to the solu- 
tions of Pi for which there is at least one solution in P 2 that is compatible and 
not disjoint (no bound variable in common), then we obtain as result the tuples 
corresponding to solutions of Pi such that for all solution P2 the solutions are 
incompatible or are disjoint (the semantics of MINUS). Now, the difference ex- 
pression will have no duplicates, and thus we keep only the solutions of Pi that 
join (i.e. that obey to the condition), without increasing the cardinality of the 
result. 

OPTIONAL pattern The relational algebra expression [(Pi OPTIONAL (P 2 FILTER P))]^ 
is 

[(Pi AND P 2 )]£ 
U 

nG,var(P 1 )U{v^unb\vevar(P 2 )\var{P 1 )} 



( 



[Pi. 



n 



S([P 



n G ,var( Pl ) ([(Pi AND P 2 ) FILTER R]%) 



According to the semantics of SPARQL 1.1, the OPTIONAL pattern evaluation is 
performed by two operators: a join and a left join. This is particularly clear in 
the translation, where the join is the first expression and the left join the lower 
(big) expression below the union operator. The rationale of the translation of 
the left join operator is identical to the translation of the MINUS pattern, except 
now that we need to make unbound the variables in P 2 but not in Pi. Again, the 
number of obtained tuples is according to the semantics of SPARQL 1.1: it is the 
sum of the tuples of the join with the sum of tuples of the left join (guaranteed 
by the bag semantics of U) . 
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GRAPH pattern The translation of (GRAPH term Pi) has two cases: 
- If term is an IRI then [(GRAPH term Pi)]£ is 

[()]£ N n var{Pl) n G , (pc^gid (o-*erm=iRi (Graphs))) N [Pi]^ 

The selection expression obtains a single tuple containing the identifier cor- 
responding to the term, or obtains the empty relation if there is no named 
graph with that IRI. This tuple joins with [Pi]^ to limit the results to the 
intended graph. Moreover it is important that the use of a renamed attribute 
for the graph is necessary in order to avoid clashes of attributes correspond- 
ing to the active graph. Moreover, the graph pattern will return the same 
results independently of the active graph and this is captured by the join 
with the empty graph pattern. 



If term is a variable v then [(GRAPH term Pi)] n is 



[Q\n K ^Mui>ar(P!) PG'<- g id,i;<-iRi (o- g id>o(Graphs)) n [Pi]^ 

This case is a little more complex, because now we consider all the named 
graphs (gid > 0, and bind the variable v with the corresponding IRIs. The 
rationale of the construction is the same of the previous case. 

To conclude the proof, we just need to analyse the results of the expression 
corresponding to the full query [SPARQL(P, D{G), V)] n = U v cr G <=o ([()]£' N [ P \n ) 
with respect to the base relations Graphs and Quads, where G' is a new attribute 
name and V C var(P). The selection starts the evaluation at the default graph 
{G 1 = 0) and projects in the selected variables. The correctness of the translation 
is now immediate due to the previous induction. 
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